{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Judgments and Feature Logging\n",
    "\n",
    "In this notebook, we cover the first two steps of Learning to Rank. First we grade some documents as relevant/irrelevant for queries, what we call _judgments_. Second, we retrieve some _features_ - metadata about each graded document in our judgments. We call the process of extracting document features _feature logging_\n",
    "\n",
    "NOTE: This notebook depends upon TheMovieDB dataset. If you have any issues, please rerun the [Setting up TheMovieDB notebook](1.setup-the-movie-db.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "import sys\n",
    "import IPython.display as ipdisplay\n",
    "import json\n",
    "sys.path.append('../..')\n",
    "from aips import *\n",
    "engine = get_engine()\n",
    "tmdb_collection = engine.get_collection(\"tmdb\")\n",
    "ltr = get_ltr_engine(tmdb_collection)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ommitted from book\n",
    "A single judgment, grading document 37799 (\"The Social Network\") as relevant (`grade=1`) for the search query string `social network`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ltr.judgments import Judgment\n",
    "\n",
    "Judgment(grade=1, keywords='social network', doc_id=37799)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 10.3\n",
    "\n",
    "A bit bigger judgment list. Here two query strings are graded: `social network` and `star wars`. For `social network` a single movie is graded as relevant, three are irrelevant. Two movies are graded as relevant for `star wars`, three others graded as irrelevant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[],weight=1),\n",
       " Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),\n",
       " Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_judgments = [\n",
    "    # for 'social network' query\n",
    "    Judgment(1, 'social network', '37799'),  # The Social Network\n",
    "    Judgment(0, 'social network', '267752'), # #chicagoGirl\n",
    "    Judgment(0, 'social network', '38408'),  # Life As We Know It\n",
    "    Judgment(0, 'social network', '28303'),  # The Cheyenne Social Club\n",
    "    \n",
    "    # for 'star wars' query\n",
    "    Judgment(1, 'star wars', '11'),     # star wars\n",
    "    Judgment(1, 'star wars', '1892'),   # return of jedi\n",
    "    Judgment(0, 'star wars', '54138'),  # Star Trek Into Darkness\n",
    "    Judgment(0, 'star wars', '85783'),  # The Star\n",
    "    Judgment(0, 'star wars', '325553')  # Battlestar Galactica\n",
    "]\n",
    "sample_judgments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Demonstrating we have no features for any of our judgments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_judgments[0].features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 10.4\n",
    "\n",
    "Create a feature set, the first feature retrieves the relevance score of the search string in the `title` field (hence `title:(${keywords})`), the second feature the same for `overview`, finally the third feature is simply the `release_year` of the movie. \n",
    "\n",
    "We create a feature store named `movie_model` in our search engine. We'll use the feature store name as a handle when we want to log features farther down."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'name': 'title_bm25',\n",
       "  'class': 'org.apache.solr.ltr.feature.SolrFeature',\n",
       "  'params': {'q': 'title:(${keywords})'},\n",
       "  'store': 'movie_model'},\n",
       " {'name': 'overview_bm25',\n",
       "  'class': 'org.apache.solr.ltr.feature.SolrFeature',\n",
       "  'params': {'q': 'overview:(${keywords})'},\n",
       "  'store': 'movie_model'},\n",
       " {'name': 'release_year',\n",
       "  'class': 'org.apache.solr.ltr.feature.SolrFeature',\n",
       "  'params': {'q': '{!func}release_year'},\n",
       "  'store': 'movie_model'}]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ltr.delete_feature_store(\"movies\")\n",
    "\n",
    "feature_set = [\n",
    "    ltr.generate_query_feature(feature_name=\"title_bm25\", field_name=\"title\"),\n",
    "    ltr.generate_query_feature(feature_name=\"overview_bm25\", field_name=\"overview\"),\n",
    "    ltr.generate_field_value_feature(feature_name=\"release_year\", field_name=\"release_year\")]\n",
    "\n",
    "ltr.upload_features(features=feature_set, model_name=\"movie_model\")\n",
    "display(feature_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Listing 10.5\n",
    "\n",
    "Recall we have one relevant and three irrelevant movies for `social network`. Here we retrieve all three features created above for each of the four movies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'id': '38408',\n",
       "  'title': 'Life As We Know It',\n",
       "  '[features]': {'title_bm25': 0.0,\n",
       "   'overview_bm25': 4.353118,\n",
       "   'release_year': 2010.0}},\n",
       " {'id': '37799',\n",
       "  'title': 'The Social Network',\n",
       "  '[features]': {'title_bm25': 8.243603,\n",
       "   'overview_bm25': 3.8143613,\n",
       "   'release_year': 2010.0}},\n",
       " {'id': '267752',\n",
       "  'title': '#chicagoGirl',\n",
       "  '[features]': {'title_bm25': 0.0,\n",
       "   'overview_bm25': 6.0172443,\n",
       "   'release_year': 2013.0}},\n",
       " {'id': '28303',\n",
       "  'title': 'The Cheyenne Social Club',\n",
       "  '[features]': {'title_bm25': 3.4286604,\n",
       "   'overview_bm25': 3.1086721,\n",
       "   'release_year': 1970.0}}]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ids = [\"37799\", \"267752\", \"38408\", \"28303\"] #social network graded documents\n",
    "options = {\"keywords\": \"social network\"}\n",
    "response = ltr.get_logged_features(\"movie_model\", ids,\n",
    "                 options=options, fields=[\"id\", \"title\"])\n",
    "display(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Omitted from book - parse features search engine response\n",
    "\n",
    "**The following code is used to generate Listings below. But this parsing code is omitted from the book itself.**\n",
    "\n",
    "Docuemnts returned by the serach engine has features returned in the `[features]` property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def populate_features_for_qid(qid, docs, judgments):\n",
    "    doc_id_to_features = {}\n",
    "\n",
    "    # Map Doc Id => Features\n",
    "    for doc in docs:\n",
    "        features = doc[\"[features]\"]\n",
    "        doc_id_to_features[doc[\"id\"]] = list(features.values())\n",
    "\n",
    "    # Save in correct judgment\n",
    "    for judgment in judgments:\n",
    "        if judgment.qid == qid:\n",
    "            try:\n",
    "                judgment.features = doc_id_to_features[judgment.doc_id]\n",
    "            except KeyError:\n",
    "                pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Listing 10.6 (output)\n",
    "\n",
    "Listing 10.7 is the output of the following, the resulting processing of logging just for `social network`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[8.243603, 3.8143613, 2010.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[0.0, 6.0172443, 2013.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[0.0, 4.353118, 2010.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[3.4286604, 3.1086721, 1970.0],weight=1),\n",
       " Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),\n",
       " Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "populate_features_for_qid(1, response, sample_judgments)\n",
    "sample_judgments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 10.7 (output)\n",
    "\n",
    "Listing 10.8 is the output of the following, which adds features parsed from the `star wars` movies to our judgment list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[8.243603, 3.8143613, 2010.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[0.0, 6.0172443, 2013.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[0.0, 4.353118, 2010.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[3.4286604, 3.1086721, 1970.0],weight=1),\n",
       " Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[6.7963624, 0.0, 1977.0],weight=1),\n",
       " Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[0.0, 1.9681965, 1983.0],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[2.444128, 0.0, 2013.0],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[3.1871135, 0.0, 1952.0],weight=1),\n",
       " Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[0.0, 0.0, 2003.0],weight=1)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ids = [\"11\", \"1892\", \"54138\", \"85783\", \"325553\"] #social network graded documents\n",
    "options = {\"keywords\": \"star wars\"}\n",
    "documents = ltr.get_logged_features(\"movie_model\", ids, options=options, fields=[\"id\", \"title\"])\n",
    "populate_features_for_qid(2, documents, sample_judgments)\n",
    "sample_judgments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading / logging training set (omitted from book)\n",
    "\n",
    "The following downloads a larger judgment list, parses it, and logs features for each graded document. It just repeats the full logging workflow in this notebook but in one loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Duplicate docs in for query id 67: {'2503'}\n",
      "Missing doc 225130 with error\n",
      "Missing doc 37106 with error\n",
      "Duplicate docs in for query id 74: {'11852'}\n",
      "Missing doc 61919 with error\n",
      "Missing doc 67479 with error\n",
      "Missing doc 17882 with error\n",
      "Duplicate docs in for query id 95: {'17431'}\n",
      "Duplicate docs in for query id 98: {'1830'}\n",
      "Parsing QID 100\n",
      "Duplicate docs in for query id 99: {'9799'}\n",
      "Missing doc 61920 with error\n",
      "Sample Logged Features: \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[Judgment(grade=1,qid=1,keywords=rambo,doc_id=7555,features=[5.9264307, 5.078817, 2008.0],weight=1),\n",
       " Judgment(grade=1,qid=1,keywords=rambo,doc_id=1370,features=[5.025649, 5.751174, 1988.0],weight=1),\n",
       " Judgment(grade=1,qid=1,keywords=rambo,doc_id=1369,features=[3.4517245, 4.8904457, 1985.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=rambo,doc_id=13258,features=[0.0, 4.5890946, 2007.0],weight=1),\n",
       " Judgment(grade=1,qid=1,keywords=rambo,doc_id=1368,features=[0.0, 5.394124, 1982.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=rambo,doc_id=31362,features=[0.0, 3.7886639, 1988.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=rambo,doc_id=61410,features=[0.0, 2.1226306, 2010.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=rambo,doc_id=319074,features=[0.0, 0.0, 2015.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=rambo,doc_id=10296,features=[0.0, 0.0, 2004.0],weight=1),\n",
       " Judgment(grade=0,qid=1,keywords=rambo,doc_id=35868,features=[0.0, 0.0, 2001.0],weight=1)]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from ltr.log import FeatureLogger\n",
    "from ltr.judgments import judgments_open\n",
    "from itertools import groupby\n",
    "\n",
    "collection = engine.get_collection(\"tmdb\")\n",
    "ftr_logger = FeatureLogger(collection, feature_set=\"movie_model\")\n",
    "\n",
    "with judgments_open(\"data/ai_pow_search_judgments.txt\") as judgments:\n",
    "    for qid, query_judgments in groupby(judgments, key=lambda j: j.qid):\n",
    "        ftr_logger.log_for_qid(query_judgments, qid, judgments.keywords(qid))\n",
    "    \n",
    "print(\"Sample Logged Features: \")\n",
    "display(ftr_logger.logged[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next up, prep data for training\n",
    "\n",
    "We now have a dataset extracted from the search engine. Next we need to perform some manipulation to the training data. This manipulation turns our slightly strange looking ranking problem into one that looks more like any-other boring machine learning problem.\n",
    "\n",
    "Up next: [Feature Normalization and Pairwise Transform](3.pairwise-transform.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
